[% setvar title Unicode Combinatorix %]
<div id="archive-notice">
    <h3>This file is part of the Perl 6 Archive</h3>
    <p>To see what is currently happening visit <a href="http://www.perl6.org/">http://www.perl6.org/</a></p>
</div>
<div class='pod'>
<a name='TITLE'></a><h1>TITLE</h1>
<p>Unicode Combinatorix</p>
<a name='VERSION'></a><h1>VERSION</h1>
<pre>  Maintainer: Simon Cozens &lt;<a href='mailto:simon@brecon.co.uk'>simon@brecon.co.uk</a>&gt;
  Date: 25 Sep 2000
  Mailing List: <a href='mailto:perl6-internals@perl.org'>perl6-internals@perl.org</a>
  Number: 312
  Version: 1
  Status: Developing</pre>
<a name='ABSTRACT'></a><h1>ABSTRACT</h1>
<p>How and when Unicode is used in Perl 6. Its sister RFC &quot;When UTF8 Leaks
Out&quot; deals with how output is processed with respect to Unicode; this
RFC deals with input and processing.</p>
<a name='DESCRIPTION'></a><h1>DESCRIPTION</h1>
<p>Here is an proposed overview of Perl 6's Unicode handling, except for
all aspects of output.</p>
<p>Data comes into Perl in one of two methods: through a filehandle or
socket, where line disciplines apply, or &quot;any other method&quot;.</p>
<p>Data which comes in through a line discipline <b>must</b> be in UTF8, unless
<code>no unicode</code> is in force.</p>
<p>Examples of data which enter Perl from &quot;any other method&quot; are
environment variables, stuff coming out of backticks, and globs.
It's assumed that these will come into Perl as ISO8859-1, and will be
converted into UTF8 on entry. If the user wants to tell us that this
data won't be ISO8859-1, then they may say <code>use unicode::system 'discipline'</code>
where <code>'discipline'</code> is any level 2 processing module. The data will be
filtered through that module, and we will have UTF8 data at the end of
this.</p>
<p>OK, at this stage, everything inside Perl is in Unicode. The next thing
we need to do is to normalise it, and the RFC on normalisation covers
that: basically, we either do this on demand or immediately. The setting
of <code>unicode::exact</code> comes into play here.</p>
<p>Comparisons can be carried out in three ways: if
<code>unicode::representation</code> is in force, then the bytes must be exactly
equal. If <code>unicode::exact</code> is set, a normalised copy of the operands is
made, and they are compared. Otherwise, the operands are normalised and
compared. As the FAQ says, &quot;Canonical equivalence matters&quot;.</p>
<p>Collation should take place according to the Unicode collation tables;
if <code>use locale</code> is set, then the collation is localised as well. The
Unicode locales RFC suggests other areas affected by locales, such as
word and line breaking and Unicode character classes.</p>
<p><code>no unicode</code> just throws everything. None of the above happens.</p>
<a name='IMPLEMENTATION'></a><h1>IMPLEMENTATION</h1>
<p>Just leave that to me...</p>
<a name='REFERENCES'></a><h1>REFERENCES</h1>
<p>&quot;What Level of Support Should I Look For?&quot;, <a href='http://www.unicode.org/unicode/faq' target='_blank'>www.unicode.org</a></p>
<p>RFC 311: Line Disciplines</p>
<p>RFC 295 Normalisation and <code>unicode::exact</code></p>
<p>RFC 300 <code>use unicode::representation</code> and <code>no unicode</code></p>
<p>RFC ??: When UTF8 Leaks Out</p>
<p>RFC ??: Abstract Internals String Interaction</p>
<p>RFC ??: Unicode Locales</p>
</div>
